I have recently been reading “Mathletics” by Wayne Winston (http://waynewinston.com/wordpress/?page_id=13). It’s a neat book where this sport enthusiast and math professor walks the reader through how to calculate most any statistic for baseball, basketball, and football. I found his chapters on basketball and the NBA very fun to read. I also wanted to test his thoughts on how to predict wins for NBA teams using only information from the box score. He issues the assertion that you can create a wins model by looking at about 8 different variables regarding everything from effectiveness in the team’s shooting to the other team’s effectiveness in getting to the free throw line.
To create the model, Winston compares a team’s abilities against their opponents (i.e. The team’s EFG% vs the opponent’s EFG%). This is done with EFG, TTP, Rebrate, and FTR by just take the differences between the team’s statistics and their opponents. I’ve created these fields below to be studied within this project.
I’ve outlined all details of the data in the “NBA Data Info”" file.
## [1] 1014 18
## [1] "X" "Team" "year" "W.L."
## [5] "Conference" "MadePlayoffs" "EFG" "Opp_EFG"
## [9] "TTP" "DTTP" "ORebRate" "DRebRate"
## [13] "FTR" "OFTR" "EFG_diff" "TTP_diff"
## [17] "RebRate_diff" "FTR_diff"
## 'data.frame': 1014 obs. of 18 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Team : Factor w/ 39 levels "Atlanta Hawks",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1981 1985 1990 1992 2000 2001 2002 2003 2004 2005 ...
## $ W.L. : num 0.378 0.415 0.5 0.463 0.341 0.305 0.402 0.427 0.341 0.159 ...
## $ Conference : Factor w/ 2 levels "East","West": 1 1 1 1 1 1 1 1 1 1 ...
## $ MadePlayoffs: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ EFG : num 0.48 0.489 0.496 0.481 0.46 ...
## $ Opp_EFG : num 0.498 0.485 0.509 0.497 0.481 ...
## $ TTP : num 0.266 0.247 0.219 0.222 0.246 ...
## $ DTTP : num 0.27 0.246 0.226 0.221 0.2 ...
## $ ORebRate : num 0.336 0.316 0.353 0.323 0.301 ...
## $ DRebRate : num 0.642 0.623 0.625 0.654 0.668 ...
## $ FTR : num 0.293 0.25 0.277 0.203 0.217 ...
## $ OFTR : num 0.295 0.249 0.254 0.209 0.196 ...
## $ EFG_diff : num -0.01755 0.00356 -0.01314 -0.0158 -0.02156 ...
## $ TTP_diff : num -0.003762 0.001072 -0.006916 0.000317 0.04618 ...
## $ RebRate_diff: num -0.306 -0.307 -0.272 -0.331 -0.367 ...
## $ FTR_diff : num -0.0017 0.00152 0.02303 -0.00643 0.02137 ...
## [1] "Atlanta Hawks"
## [2] "Boston Celtics"
## [3] "Brooklyn Nets"
## [4] "Charlotte Bobcats"
## [5] "Charlotte Hornets"
## [6] "Chicago Bulls"
## [7] "Cleveland Cavaliers"
## [8] "Dallas Mavericks"
## [9] "Denver Nuggets"
## [10] "Detroit Pistons"
## [11] "Golden State Warriors"
## [12] "Houston Rockets"
## [13] "Indiana Pacers"
## [14] "Kansas City Kings"
## [15] "Los Angeles Clippers"
## [16] "Los Angeles Lakers"
## [17] "Memphis Grizzlies"
## [18] "Miami Heat"
## [19] "Milwaukee Bucks"
## [20] "Minnesota Timberwolves"
## [21] "New Jersey Nets"
## [22] "New Orleans Hornets"
## [23] "New Orleans Pelicans"
## [24] "New Orleans/Oklahoma City Hornets"
## [25] "New York Knicks"
## [26] "Oklahoma City Thunder"
## [27] "Orlando Magic"
## [28] "Philadelphia 76ers"
## [29] "Phoenix Suns"
## [30] "Portland Trail Blazers"
## [31] "Sacramento Kings"
## [32] "San Antonio Spurs"
## [33] "San Diego Clippers"
## [34] "Seattle SuperSonics"
## [35] "Toronto Raptors"
## [36] "Utah Jazz"
## [37] "Vancouver Grizzlies"
## [38] "Washington Bullets"
## [39] "Washington Wizards"
## [1] "No" "Yes"
## [1] "East" "West"
## X Team year
## Min. : 1.0 Atlanta Hawks : 37 Min. :1980
## 1st Qu.: 254.2 Boston Celtics : 37 1st Qu.:1990
## Median : 507.5 Chicago Bulls : 37 Median :1999
## Mean : 507.5 Cleveland Cavaliers: 37 Mean :1999
## 3rd Qu.: 760.8 Denver Nuggets : 37 3rd Qu.:2008
## Max. :1014.0 Detroit Pistons : 37 Max. :2016
## (Other) :792
## W.L. Conference MadePlayoffs EFG
## Min. :0.1060 East:508 No :441 Min. :0.4242
## 1st Qu.:0.3780 West:506 Yes:573 1st Qu.:0.4760
## Median :0.5120 Median :0.4892
## Mean :0.5000 Mean :0.4894
## 3rd Qu.:0.6175 3rd Qu.:0.5012
## Max. :0.8880 Max. :0.5622
##
## Opp_EFG TTP DTTP ORebRate
## Min. :0.4226 Min. :0.1919 Min. :0.1864 Min. :0.1804
## 1st Qu.:0.4769 1st Qu.:0.2255 1st Qu.:0.2251 1st Qu.:0.2549
## Median :0.4900 Median :0.2380 Median :0.2372 Median :0.2813
## Mean :0.4892 Mean :0.2379 Mean :0.2379 Mean :0.2813
## 3rd Qu.:0.5022 3rd Qu.:0.2500 3rd Qu.:0.2497 3rd Qu.:0.3078
## Max. :0.5398 Max. :0.3089 Max. :0.2998 Max. :0.3767
##
## DRebRate FTR OFTR EFG_diff
## Min. :0.5790 Min. :0.1456 Min. :0.1580 Min. :-0.0752389
## 1st Qu.:0.6454 1st Qu.:0.2155 1st Qu.:0.2143 1st Qu.:-0.0176898
## Median :0.6658 Median :0.2341 Median :0.2338 Median :-0.0005686
## Mean :0.6670 Mean :0.2356 Mean :0.2357 Mean : 0.0001438
## 3rd Qu.:0.6896 3rd Qu.:0.2554 3rd Qu.:0.2574 3rd Qu.: 0.0195345
## Max. :0.7503 Max. :0.3344 Max. :0.3466 Max. : 0.0824197
##
## TTP_diff RebRate_diff FTR_diff
## Min. :-7.189e-02 Min. :-0.5482 Min. :-0.1375773
## 1st Qu.:-1.521e-02 1st Qu.:-0.4329 1st Qu.:-0.0235689
## Median :-3.601e-04 Median :-0.3848 Median : 0.0009545
## Mean : 5.797e-05 Mean :-0.3857 Mean :-0.0001364
## 3rd Qu.: 1.522e-02 3rd Qu.:-0.3357 3rd Qu.: 0.0244218
## Max. : 6.364e-02 Max. :-0.2376 Max. : 0.0967220
##
The data set has 1,014 observations of NBA teams across 36 years (1980 to 2016). I choose to begin in 1980 because that is the first year there was a 3 point line. Since EFG will become such an important variable (spoiler alert!), I thought it would be better to have 3P Field Goal data in the study. Each line begins with the team’s name (Usually a combination of city and mascot), the year in which the team participated in the NBA, and their winning percentage (Wins / Games Played) for the respective season.
Each observation has variables associated with advanced statistics of basketball performance. Outside of the category fields (i.e. Making the playoffs and a team’s conference), the variables have a pairing of how well the team performs in this area and how well the team’s opponents fair in the same area. This includes variables like:
Effective Field Goal % (EFG) & Opponent Effective Field Goal % (Opp_EFG) Turnovers / Possession (TTP) & Turnovers Caused / Possession (DTTP) Offensive Rebound Rate (ORebRate) & Defensive Rebound Rate (DRebRate) Free Throw Rate (FTR) & Opponent’s Free Throw Rate (OFTR)
Since there is such a pairing among 8 (4 pairings) of my variables, I’ve also included the difference to better display how a team performs against opponents for that skill (i.e. Effective Field Goal % Difference).
Some of my categorical variables include whether a team made the playoffs (post season tournament reserved for top half of teams) and which conference they reside (East or West).
The main features of the data set are Winning % and the 4 pairing team vs opponent difference variables (EFG, TTP, RebRate, and FTR) as I want expand upon Wayne Winston’s work (He only examined 1 year, using Excel) of predicting winning % from the difference variables.
I think it will also be helpful to look at the individual variables (non difference pairs) and the “MadePlayoffs” variable in terms of predicting win percentage. I believe the “MadePlayoffs” variable will be very helpful when graphed to show direction of variable effectiveness (i.e. Using a scatter plot with colored dots based on if the team made the playoffs will show which teams are doing better).
I created the difference pair variables. It’s the method Winston determined was the based relationship for measuring wins. The remainder were created by scraping data from Basketball Reference website (Outlined in separate file).
It’s very interesting that the mean for the W.L. is .500. I say that because it seems very ideal. It’s not always the case but I can also see that we should have “normal” like graphs as most of the means and medians are very close in values. I was able to scrape the data in the current form (it just needed a few clean up scripts and merging).
The graphs below offer a quick histogram for each variable. I also broke out each variable within “facet_wrap” to show any difference for playoff teams and non-playoff teams. I thought this would help determine if there are basketball benefits to the variable as going to the playoffs is seen as being success.
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1060 0.2930 0.3660 0.3583 0.4390 0.5850
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3660 0.5370 0.6100 0.6091 0.6710 0.8880
The W.L.% has a somewhat normal distribution. It’s also obvious that teams with a higher winning percentage are more likely to be in the playoffs.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4242 0.4760 0.4892 0.4894 0.5012 0.5622
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4242 0.4693 0.4792 0.4793 0.4913 0.5451
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4355 0.4853 0.4964 0.4971 0.5086 0.5622
The 2nd graph kind of make me think better shooting teams are more likely to be in the playoffs. The range seems kind of small at .42 to .56. I figured some teams would be worse.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4226 0.4769 0.4900 0.4892 0.5022 0.5398
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4453 0.4888 0.4995 0.4991 0.5109 0.5398
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4226 0.4706 0.4823 0.4816 0.4930 0.5278
I think the 2nd graph paints a similar picture to before. This time it leads us to think that better defenses will get in the playoffs. Again, the range is quite small (smaller than EFG) at .42 to .54.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0752400 -0.0176900 -0.0005686 0.0001438 0.0195300 0.0824200
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.075240 -0.031800 -0.018650 -0.019820 -0.007291 0.036250
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.740e-02 1.251e-05 1.383e-02 1.551e-02 3.063e-02 8.242e-02
It’s great to see the normal distribution as that should help with the linear model. There does seem to be a big effect between playoff and non-playoff teams if you look at the graph.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1919 0.2255 0.2380 0.2379 0.2500 0.3089
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2010 0.2336 0.2451 0.2452 0.2562 0.3089
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1919 0.2208 0.2319 0.2324 0.2432 0.2929
Teams should expect to turn the ball over between 1/5 to 1/3 of their possessions. The difference doesn’t seem to be as big between playoff and non-playoff teams as compared to EFG.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1864 0.2251 0.2372 0.2379 0.2497 0.2998
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1902 0.2214 0.2313 0.2323 0.2426 0.2928
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1864 0.2293 0.2424 0.2422 0.2540 0.2998
Defense seems to cause fewer turnovers than opponents have turnovers, ranging from .19 to .3. I think this is explainable as there could be “unforced” turnovers for the offense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.189e-02 -1.521e-02 -3.601e-04 5.797e-05 1.522e-02 6.364e-02
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0417100 0.0003523 0.0128800 0.0128600 0.0247600 0.0636400
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.071890 -0.021630 -0.010150 -0.009796 0.002714 0.044160
Another very “normalish” variable to use for linear model building.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1804 0.2549 0.2813 0.2813 0.3078 0.3767
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1940 0.2528 0.2768 0.2777 0.3038 0.3767
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1804 0.2572 0.2853 0.2841 0.3120 0.3693
Teams will expect to grab their off 18% to 38% of their offensive rebounds. Visually, I’m not sure I can tell a difference between playoff and non-playoff teams for this variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5790 0.6454 0.6658 0.6670 0.6896 0.7503
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5790 0.6435 0.6647 0.6652 0.6879 0.7454
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5889 0.6468 0.6664 0.6685 0.6907 0.7503
Teams appear to grab their defensive rebounds at a much higher rate. The range is about .56 to .75. I have heard of strategies were the offense will not go for rebounds as much but instead, get back early (after the initial shot) to play defense.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5482 -0.4329 -0.3848 -0.3857 -0.3357 -0.2376
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5406 -0.4321 -0.3869 -0.3874 -0.3392 -0.2587
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5482 -0.4334 -0.3826 -0.3843 -0.3333 -0.2376
This variable is much less normal. I almost want to say it has a bi-modal shape to it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1456 0.2155 0.2341 0.2356 0.2554 0.3344
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1456 0.2093 0.2247 0.2282 0.2457 0.3209
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1612 0.2218 0.2407 0.2413 0.2616 0.3344
Teams should expect their “Free Throw Rate” to be between .15 and .33. The IQR is quite small at only about .04.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1580 0.2143 0.2338 0.2357 0.2574 0.3466
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1754 0.2180 0.2394 0.2411 0.2610 0.3466
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1580 0.2105 0.2283 0.2316 0.2519 0.3295
Very similar story with OFTR as compared to FTR.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1376000 -0.0235700 0.0009545 -0.0001364 0.0244200 0.0967200
## nba$MadePlayoffs: No
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.13760 -0.03404 -0.01354 -0.01289 0.01055 0.08248
## --------------------------------------------------------
## nba$MadePlayoffs: Yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.085370 -0.012600 0.006060 0.009676 0.032740 0.096720
Another normal looking variable for the linear model.
Almost all, with the exception being the rebounding graphs, our data have a fairly normal look to it. The rebounding data is still somewhat normal but I think it almost has a bi-modal distribution. This is really great news as it allows us to continue with the process of building a model where these variables could predict the win % for a given team.
Since I hope to make a Win% Prediction model, I believe my bi-variate analysis and plots should focus on each variable’s relationship with the Win% variable to better understand their importance to the model. I also think I should ensure model viability by making sure the remaining variables are not too correlated with each other.
I found that the “difference” variables of EFG, TTP, and FTR definitely have a relationship with winning %. However, the RebRate difference variable did not have much of a relationship. This is quite surprising as I thought this would be one of the biggest factors of winning basketball games.
It’s very important to note that the four main difference pair do not have much correlation with each other. This is very important for the linear regression model assumptions.
As far as I can tell, the EFG difference variable vs winning % was the strongest relationship with a correlation of 85%. This makes a lot of sense. It’s extremely easy to say that teams that shoot well and keep opponents from shooting well, should win a lot of basketball games.
The first 6 plots below show the strongest relationships with Win%. The narrative of it makes a lot of sense, if you shoot well, the other teams don’t shoot well, and you don’t turn the ball over, you are likely to win games.
## [1] 0.6077424
Fairly strong relationship. The better a team can shoot effectively, the more games they should win.
## [1] -0.5704137
The variance looks a little larger but there is still a relationship.
## [1] 0.8452894
This is my strongest relationship in the whole study. As mentioned early, for very obvious reasons. Adding the MadePlayoffs variable to the colors helps strengthen the visual of the relationship.
## [1] -0.4608441
## [1] -0.5925833
Another fairly strong relationship that indicates that it is better to create turnovers than commit them.
## [1] 0.4066263
My correlation table shows that there is some correlation for these two variables but it is very difficult to see.
The next two plots of this section show the strongest relationships among non-Win% variables. They both make a lot of sense. In the first example, it’s easy to explain that teams that shoot well don’t have as many opportunities to turn the ball over. They are playing well and probably aren’t as likely to turn the ball over. In the second example, is a little more difficult to explain away but I could see a situation where teams that go for offensive rebounds are more likely to get fouled (i.e. big guys “hitting the boards” against the defense).
## [1] -0.3973709
Adding the playoff coloring doesn’t clear the relationship. In general, it’s just not a strong relationship. I put the 3rd graph in to help clear up the relationship. I was hoping to see large circles on the left and small circles on the right. This might say teams that are effective at shooting aren’t as likely to turn the ball over.
## [1] 0.3602241
The playoffs factor doesn’t seem to have much effect as there are green and red dots throughout the graph. Again, I created the 3rd graph to see if this would help paint the story of teams that are going for offensive rebounds are more likely to get fouled. I wanted to see small dots on the left and large ones on the right but I’m not sure it’s showing that completely.
The plot below is very difficult to understand. As I don’t know why RebRate_diff isn’t correlated with Winning %. I have always understood rebounding better than your opponent leads to winning the game. ## RebRate_diff vs W.L.
## [1] -0.02916389
I’ve also included a few category variable plots.
##
## Calls:
## m1: lm(formula = W.L. ~ EFG_diff, data = nba)
## m2: lm(formula = W.L. ~ EFG_diff + TTP_diff, data = nba)
## m3: lm(formula = W.L. ~ EFG_diff + TTP_diff + RebRate_diff, data = nba)
## m4: lm(formula = W.L. ~ EFG_diff + TTP_diff + RebRate_diff + FTR_diff,
## data = nba)
##
## ==============================================================
## m1 m2 m3 m4
## --------------------------------------------------------------
## (Intercept) 0.499*** 0.500*** 0.541*** 0.527***
## (0.003) (0.002) (0.013) (0.011)
## EFG_diff 4.890*** 4.242*** 4.270*** 3.880***
## (0.097) (0.076) (0.076) (0.071)
## TTP_diff -2.711*** -2.696*** -2.833***
## (0.096) (0.096) (0.085)
## RebRate_diff 0.107** 0.070*
## (0.033) (0.029)
## FTR_diff 0.861***
## (0.051)
## --------------------------------------------------------------
## R-squared 0.71 0.84 0.84 0.88
## adj. R-squared 0.71 0.84 0.84 0.88
## sigma 0.08 0.06 0.06 0.05
## F 2532.83 2652.41 1788.32 1788.96
## p 0.00 0.00 0.00 0.00
## Log-likelihood 1082.58 1375.90 1381.13 1507.10
## Deviance 7.02 3.94 3.89 3.04
## AIC -2159.15 -2743.81 -2752.26 -3002.19
## BIC -2144.39 -2724.12 -2727.65 -2972.66
## N 1014 1014 1014 1014
## ==============================================================
## [1] 0.04402368
Looks like the model predicts winning % with only a 4.4% absolute error rate! Not too bad.
Yes. The model is able to account for 88% of the variance in Winning percentage of teams each year. This is based on using the 4 pairing difference variables as inputs. This models allows one to fairly accurate predict the winning % based on those four variables. The most important factor, by far, is the EFG_diff variable as using that variable alone I am able to account for 71% of the variance.
I believe this charts states the most obvious statement. If your team wins basketball games, your team is more likely to make it to the playoffs. The colors really complete the picture as they are quite differentiated from the left and right. However, I will note there does appear to be instances of losing teams (teams with a Win% below 50%) making the playoffs and winning teams (teams with a Win% above 50%) not making the playoffs.
This chart helps display the strongest variable for the linear regression model. The EFG Difference variable controls about 71% of the variance of predicting winning %. I’ve also added some playoff coloring to help show the effectiveness of the variable.
This graph is not critical to exploring the main goal of the project (Predicting Win%). However, I did want to give any noticeable relationship (In this case, about .4) an investigation. I was trying to better understand why this relationship is present. My first thought is that teams that shoot well are probably just playing good basketball and not turning the ball over as much. If that is truly the case, then we should see large dots in the lower left and smaller dots in the upper right. I think this happens some but it is nowhere near air tight.
This study was definitely a learning experience. It began by reading Winston’s chapters on quantifying wins within NBA, became slightly difficult scraping the data from Basketball Reference, and then became very tedious building graph after graph.
In Winston’s chapter on basketball, he only looked at one years worth of basketball statistics to calculate wins. I knew I could expand that study to as many years as they had relevant data but I would need to change from wins to winning % as some years don’t have the same amount of games played. Going from there I really had to dive in on the “rvest” package to learn more about web scraping in R (I’ve already completed the Data Wrangling Course and I think it helped a lot). I found that the data was mostly clean except for slight variations that occurs over the years at the Basketball Reference site (i.e. Some years they label 3P% as 3P, other it is labeled correctly.)
Once I had all my data, I originally only had the mindset of replicating the Winston linear model. I wanted to test the winning percentage model and call it a day. However, I decided to go along with the project outline and I found myself looking at the data in other ways that led to other questions I could be asking (i.e. Why do teams have less turnovers when they shoot well?). By the time I reached the conclusion, I didn’t even care about the original model because I wanted to investigate the other questions. I feel the project definitely opened my eyes about how much we can miss out on if we don’t give the exploratory data analysis the attention and detail it deserves.